Encoding method for the compression of a video sequence

ABSTRACT

The invention relates to an encoding method for the compression of a video sequence. Said method, using a three-dimensional wavelet transform, is based on a hierarchical subband coding process in which the subbands to be encoded are scanned in an order that preserves the initial subband structure of the 3D wavelet transform. According to the invention, a temporal (resp. spatial) scalability is obtained by performing a motion estimation at each temporal resolution level (resp. at the highest spatial resolution level), and only the part of the estimated motion vectors necessary to reconstruct any given temporal (resp. spatial) resolution level is then encoded and put in the bitstream together with the bits encoding the wavelet coefficients at this given temporal (resp. spatial) level, said insertion in the bitstream being done before encoding texture coefficients at the same temporal (resp. spatial) level. Such a solution avoids to encode and send all the motion vector fields in the bitstream, which would be a drawback when a low bitrate is targeted and the receiver only wants a reduced frame rate or spatial resolution.

[0001] The present invention relates to an encoding method for thecompression of a video sequence divided into groups of frames anddecomposed by means of a three-dimensional (3D) wavelet transformleading to a given number of successive resolution levels thatcorrespond to the decomposition levels of said transform, said methodbeing based on a hierarchical subband encoding process leading from theoriginal set of picture elements (pixels) of each group of frames totransform coefficients constituting a hierarchical pyramid, aspatio-temporal orientation tree - in which the roots are formed withthe pixels of the approximation subband resulting from the 3D wavelettransform and the offspring of each of these pixels is formed with thepixels of the higher subbands corresponding to the image volume definedby these root pixels - defining the spatio-temporal relationship insidesaid hierarchical pyramid, the subbands to be encoded being scanned oneafter the other in an order that respects the parent-offspringdependencies formed in said tree and preserves the initial subbandstructure of the 3D wavelet transform.

[0002] The video streaming over heterogeneous networks requires a highscalability capability. That means that parts of a bitstream can bedecoded without a complete decoding of the sequence and can be combinedto reconstruct the initial video information at lower spatial ortemporal resolutions (spatial/temporal scalability) or with a lowerquality (PSNR scalability). A convenient way to achieve all the threetypes of scalability is a three-dimensional (3D) wavelet decompositionof the motion compensated video sequence.

[0003] In a previous European patent application filed by the Applicanton May 3, 2000, with the number 00401216.7 (PHFR000044), a simple methodof texture coding having this property has been described. In thatmethod, as well as in other published documents (such as for instance,in “An embedded wavelet video coder using three-dimensional setpartitioning in hierarchical trees (SPIHT)”, by B. Kim and W. A.Pearlman, Proceedings DCC'97, Data Compression Conference, Snowbird,Utah, U.S.A., Mar. 25-27, 1997, pp.251-260), all the motion vectorfields are encoded and sent in the bitstream, which may become a majordrawback when a low bitrate is targeted and the receiver only wants areduced frame rate or spatial resolution.

[0004] It is therefore an object of the invention to propose an encodingmethod more adapted to the situation where a high scalability must beobtained.

[0005] To this end, the invention relates to an encoding method such asdefined in the introductory part of the description and which ismoreover characterized in that, in view of a temporal scalability, amotion estimation is performed at each temporal resolution level, thebeginning of which is indicated by flags inserted into the bitstream,and only the estimated motion vectors necessary to reconstruct any giventemporal resolution level are encoded and put in the bitstream togetherwith the bits encoding the wavelet coefficients at this given temporallevel, said motion vectors being inserted into said bitstream beforeencoding texture coefficients at the same temporal level.

[0006] In another embodiment, the invention also relates to an encodingmethod such as defined in said introductory part and which ischaracterized in that, in view of a spatial scalability, a motionestimation is performed at the highest spatial resolution level, thevectors thus obtained being divided by two in order to obtain the motionvectors for the lower spatial resolutions, and only the estimated motionvectors necessary to reconstruct any spatial resolution level areencoded and put in the bitstream together with the bits encoding thewavelet coefficients at this given spatial level, said motion vectorsbeing inserted into said bitstream before encoding texture coefficientsat the same spatial level, and said encoding operation being carried outon the motion vectors at the lowest spatial resolution, only refinementbits at each spatial resolution being then put in the bitstream bitplaneby bitplane, from one resolution level to the other.

[0007] The technical solution thus proposed allows to encode only themotion vectors corresponding to the desired frame rate or spatialresolution, instead of sending all the motion vectors corresponding toall possible frame rates and all spatial resolution levels.

[0008] The present invention will now be described, by way of example,with reference to the accompanying drawings in which:

[0009]FIG. 1 illustrates a temporal subband decomposition of the videoinformation with motion compensation using the Haar multiresolutionanalysis;

[0010]FIG. 2 shows the spatio-temporal subbands resulting from athree-dimensional wavelet decomposition;

[0011]FIG. 3 illustrates the motion vector insertion in the bitstreamfor temporal scalability

[0012]FIG. 4 shows the structure of the bitstream obtained with atemporally driven scanning of the spatio-temporal tree;

[0013]FIG. 5 is a binary representation of a motion vector and itsprogressive transmission from the lowest resolution to the highest;

[0014]FIG. 6 shows the bitstream organization for motion vector codingin the proposed scalable approach.

[0015] A temporal subband decomposition of a video sequence is shown inFIG. 1. The illustrated 3D wavelet decomposition with motioncompensation is applied to a group of frames (GOF), referenced F1 to F8.In this 3D subband decomposition scheme, each GOF of the input video isfirst motion-compensated (MC in FIG. 1) (this step allows to processsequences with large motion) and then temporally filtered using Haarwavelets (the dotted arrows correspond to a high-pass temporalfiltering, while the other ones correspond to a low-pass temporalfiltering) and after these two operations, each temporal subband isspatially decomposed into a spatio-temporal subband, which leads to a 3Dwavelet representation of the original GOF, as illustrated in FIG. 2. InFIG. 1, three stages of decomposition are shown (L and H=first stage; LLand LH=second stage ; LLL and LLH=third stage). At each temporaldecomposition level of the illustrated group of 8 frames, a group ofmotion vector fields is generated (MV4 at the first level, MV3 at thesecond one, MV2 at the third one). When a Haar multiresolution analysisis used for the temporal decomposition, since one motion vector field isgenerated between every two frames in the considered group of frames ateach temporal decomposition level, the number of motion vector fields isequal to half the number of frames in the temporal subband, i.e. four atthe first level of motion vector fields, two at the second one, and oneat the third one. At the decoder side, in order to reconstruct a giventemporal level, only the motion vector fields at that level and at thelower temporal resolutions (reduced frame rate) are needed.

[0016] (A) Temporal Scalability

[0017] This observation leads, according to the invention, to organizethe bitstream in a way that allows a progressive decoding, as describedfor example in FIG. 3 : three temporal decomposition levels TDL (asshown in FIG. 1) yield four temporal resolution levels (1 to 4), whichrepresent the possible frame rates that can be obtained from the initialframe rate. The coefficients corresponding to the lowest resolutiontemporal level are first encoded, without sending motion vectors at thislevel, and, for all the other reconstruction frame rates, the motionvector fields and the frames of the corresponding high frequencytemporal subband are encoded. This description of the bitstreamorganization up to now only takes into account the temporal levels.However, for a complete scalability, one has to consider the spatialscalability inside each temporal level . The solution for waveletcoefficients was described in the European patent application alreadycited, and it is reminded in FIG. 4 : inside each temporal scale, allthe spatial resolutions are successively scanned (SDL=spatialdecomposition levels), and therefore all the spatial frequencies areavailable (frame rates t=1 to 4; display sizes s=1 to 4). The upperflags separate two bitplanes, and the lower ones two temporaldecomposition levels.

[0018] (B) Spatial Scalability

[0019] In order to be able to reconstruct a reduced spatial resolutionvideo, it is not desirable to transmit at the beginning of the bitstreamthe motion vector fields of full resolution. Indeed, it is necessary toadapt the motion described by the motion vectors to the size of thecurrent spatial level. Ideally, it would be desirable to have first alow resolution motion vector field corresponding to the lowest spatialresolution and then to be able to progressively increase the resolutionof the motion vectors according to the increase in the spatialresolution. Only the difference from a motion vector field resolution toanother one would be encoded and transmitted.

[0020] It will be assumed that the motion estimation is performed bymeans of a block-based method like full search block matching or anyother derived solution, with an integer pixel precision on fullresolution frames (this hypothesis does not reduce the generality of theproblem: if one wants to work with half-pixel precision for motionvectors, by multiplying by 2 all the motion vectors at the beginning,one returns in the previous case of integer vectors, even though theywill represent fractional displacements). Thus, motion vectors arerepresented by integers. Given the full resolution motion vector field,in order to satisfy the above requirements of spatial scalability, themotion vector resolution is reduced by a simple divide-by-2 operation.Indeed, as the spatial resolution of the approximation subband isreduced by a factor 2, while the motion is the same as in the fullresolution subband, the displacements will be reduced by a factor 2.This division is implemented for integers by a simple shift.

[0021] The size of the blocks in the motion estimation must be chosencarefully: indeed, if the original size of the block is 8×8 in the fullresolution, it will become 4×4 in the half resolution, then 2×2 in thequarter, and so on. A problem will therefore appear if the original sizeof the blocks is too small: the size can be null for small spatialresolutions. Thus it must be checked that the original size iscompatible with the number of decomposition/reconstruction levels.

[0022] It is now assumed that one has S spatial decomposition levels andthat one wants the motion vectors corresponding to all possibleresolutions, from the lowest to the highest. Then, either the initialmotion vectors are divided by 2^(s) or a shift of S positions isperformed. The result represents the motion vectors corresponding to theblocks from lowest resolution whom the size is divided by 2^(s). Adivision by 2^(s−) of the original motion vector would provide the nextspatial resolution. But this value is already available from theprevious operation. Indeed, it corresponds to a shift of S⁻¹ positions.The difference from the first operation is the bit in the binaryrepresentation of the motion vector with a weight of 2^(s−1). It is thensufficient to add this bit (the refinement bit) to the previouslytransmitted vector to reconstruct the motion vector at a higherresolution, which is illustrated in FIG. 5 for S=4. This progressivetransmission of the motion vectors allows to include in the bitstreamthe refinement bits of the motion vector fields from one spatialresolution to another just before the bits corresponding to the textureat the same spatial level. The proposed method is resumed in FIG. 6.

[0023] The motion vectors at the lowest resolution are encoded with aDPCM technique followed by entropy coding using usual VLC tables (e.g.,those used in MPEG-4). For the other resolution levels, a completebitplane composed of the refinement bits of the motion vector field hasto be encoded, for instance by means of a contextual arithmeticencoding, with the context depending on the horizontal or verticalcomponent of the motion vector.

[0024] The part of the bitstream representing motion vectors precedesany information concerning the texture. The difference with respect to a“classical” non-scalable approach is that the hierarchy of temporal andspatial levels is transposed to the motion vector coding. The mostsignificant improvement with respect to the previous technique is thatthe motion information can be decoded progressively. For a given spatialresolution, the decoder does not have to decode parts of the bitstreamthat are not useful at that level.

1. An encoding method for the compression of a video sequence dividedinto groups of frames and decomposed by means of a three-dimensional(3D) wavelet transform leading to a given number of successiveresolution levels that correspond to the decomposition levels of saidtransform, said method being based on a hierarchical subband encodingprocess leading from the original set of picture elements (pixels) ofeach group of frames to transform coefficients constituting ahierarchical pyramid, a spatio-temporal orientation tree - in which theroots are formed with the pixels of the approximation subband resultingfrom the 3D wavelet transform and the offspring of each of these pixelsis formed with the pixels of the higher subbands corresponding to theimage volume defined by these root pixels -defining the spatio-temporalrelationship inside said hierarchical pyramid, the subbands to beencoded being scanned one after the other in an order that respects theparent-offspring dependencies formed in said tree and preserves theinitial subband structure of the 3D wavelet transform, said method beingfurther characterized in that, in view of a temporal scalability, amotion estimation is performed at each temporal resolution level, thebeginning of which is indicated by flags inserted into the bitstream,and only the estimated motion vectors necessary to reconstruct any giventemporal resolution level are encoded and put in the bitstream togetherwith the bits encoding the wavelet coefficients at this given temporallevel, said motion vectors being inserted into said bitstream beforeencoding texture coefficients at the same temporal level.
 2. An encodingmethod for the compression of a video sequence divided into groups offrames and decomposed by means of a three-dimensional (3D) wavelettransform leading to a given number of successive resolution levels thatcorrespond to the decomposition levels of said transform, said methodbeing based on a hierarchical subband encoding process leading from theoriginal set of picture elements (pixels) of each group of frames totransform coefficients constituting a hierarchical pyramid, aspatio-temporal orientation tree - in which the roots are formed withthe pixels of the approximation subband resulting from the 3D wavelettransform and the offspring of each of these pixels is formed with thepixels of the higher subbands corresponding to the image volume definedby these root pixels -defining the spatio-temporal relationship insidesaid hierarchical pyramid, the subbands to be encoded being scanned oneafter the other in an order that respects the parent-offspringdependencies formed in said tree and preserves the initial subbandstructure of the 3D wavelet transform, said method being furthercharacterized in that, in view of a spatial scalability, a motionestimation is performed at the highest spatial resolution level, thevectors thus obtained being divided by two in order to obtain the motionvectors for the lower spatial resolutions, and only the estimated motionvectors necessary to reconstruct any spatial resolution level areencoded and put in the bitstream together with the bits encoding thewavelet coefficients at this given spatial level, said motion vectorsbeing inserted into said bitstream before encoding texture coefficientsat the same spatial level, and said encoding operation being carried outon the motion vectors at the lowest spatial resolution, only refinementbits at each spatial resolution being then put in the bitstream bitplaneby bitplane, from one resolution level to the other.